Heart disease is a broad term that refers to a range of conditions that affect the heart's structure and function. It encompasses various conditions, including coronary artery disease, heart rhythm disorders (arrhythmias), heart valve defects, congenital heart defects, and others. Heart disease can develop over time due to factors such as high blood pressure, high cholesterol, smoking, diabetes, obesity, sedentary lifestyle, and family history.
A heart attack, on the other hand, is a specific medical event that occurs when blood flow to a part of the heart is blocked or severely reduced, leading to damage or death of the heart muscle tissue. This blockage usually occurs due to the rupture of a plaque (a buildup of cholesterol and other substances) in the coronary arteries, which supply oxygen-rich blood to the heart muscle. When the blood flow is interrupted, the affected part of the heart muscle is deprived of oxygen and nutrients, causing tissue damage.
In summary:
Heart disease is a general term that refers to various conditions affecting the heart. A heart attack is a specific event that occurs when blood flow to a part of the heart is blocked, leading to damage or death of the heart muscle tissue. Heart disease can increase the risk of experiencing a heart attack, but not all heart disease patients will necessarily have a heart attack. There are various types of heart disease, and each may have different symptoms and treatment approaches.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import statsmodels.api as sm
import numpy as np
#!pip install interpret
import interpret
from interpret.glassbox import LogisticRegression
from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from interpret.glassbox import ExplainableBoostingClassifier
from interpret import show
#!pip install causalinference
from causalinference import CausalModel
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import plotly.express as px
from pywaffle.waffle import Waffle
import shutil
from os import path
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import (
df = pd.read_csv('heart_2022_no_nans.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 246022 entries, 0 to 246021 Data columns (total 40 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 State 246022 non-null object 1 Sex 246022 non-null object 2 GeneralHealth 246022 non-null object 3 PhysicalHealthDays 246022 non-null float64 4 MentalHealthDays 246022 non-null float64 5 LastCheckupTime 246022 non-null object 6 PhysicalActivities 246022 non-null object 7 SleepHours 246022 non-null float64 8 RemovedTeeth 246022 non-null object 9 HadHeartAttack 246022 non-null object 10 HadAngina 246022 non-null object 11 HadStroke 246022 non-null object 12 HadAsthma 246022 non-null object 13 HadSkinCancer 246022 non-null object 14 HadCOPD 246022 non-null object 15 HadDepressiveDisorder 246022 non-null object 16 HadKidneyDisease 246022 non-null object 17 HadArthritis 246022 non-null object 18 HadDiabetes 246022 non-null object 19 DeafOrHardOfHearing 246022 non-null object 20 BlindOrVisionDifficulty 246022 non-null object 21 DifficultyConcentrating 246022 non-null object 22 DifficultyWalking 246022 non-null object 23 DifficultyDressingBathing 246022 non-null object 24 DifficultyErrands 246022 non-null object 25 SmokerStatus 246022 non-null object 26 ECigaretteUsage 246022 non-null object 27 ChestScan 246022 non-null object 28 RaceEthnicityCategory 246022 non-null object 29 AgeCategory 246022 non-null object 30 HeightInMeters 246022 non-null float64 31 WeightInKilograms 246022 non-null float64 32 BMI 246022 non-null float64 33 AlcoholDrinkers 246022 non-null object 34 HIVTesting 246022 non-null object 35 FluVaxLast12 246022 non-null object 36 PneumoVaxEver 246022 non-null object 37 TetanusLast10Tdap 246022 non-null object 38 HighRiskLastYear 246022 non-null object 39 CovidPos 246022 non-null object dtypes: float64(6), object(34) memory usage: 75.1+ MB
pd.set_option('display.max_columns', None)
df.head()
| State | Sex | GeneralHealth | PhysicalHealthDays | MentalHealthDays | LastCheckupTime | PhysicalActivities | SleepHours | RemovedTeeth | HadHeartAttack | HadAngina | HadStroke | HadAsthma | HadSkinCancer | HadCOPD | HadDepressiveDisorder | HadKidneyDisease | HadArthritis | HadDiabetes | DeafOrHardOfHearing | BlindOrVisionDifficulty | DifficultyConcentrating | DifficultyWalking | DifficultyDressingBathing | DifficultyErrands | SmokerStatus | ECigaretteUsage | ChestScan | RaceEthnicityCategory | AgeCategory | HeightInMeters | WeightInKilograms | BMI | AlcoholDrinkers | HIVTesting | FluVaxLast12 | PneumoVaxEver | TetanusLast10Tdap | HighRiskLastYear | CovidPos | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Alabama | Female | Very good | 4.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | None of them | No | No | No | No | No | No | No | No | Yes | No | No | No | No | No | No | No | Former smoker | Never used e-cigarettes in my entire life | No | White only, Non-Hispanic | Age 65 to 69 | 1.60 | 71.67 | 27.99 | No | No | Yes | Yes | Yes, received Tdap | No | No |
| 1 | Alabama | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 6.0 | None of them | No | No | No | No | No | No | No | No | Yes | Yes | No | No | No | No | No | No | Former smoker | Never used e-cigarettes in my entire life | No | White only, Non-Hispanic | Age 70 to 74 | 1.78 | 95.25 | 30.13 | No | No | Yes | Yes | Yes, received tetanus shot but not sure what type | No | No |
| 2 | Alabama | Male | Very good | 0.0 | 0.0 | Within past year (anytime less than 12 months ... | No | 8.0 | 6 or more, but not all | No | No | No | No | No | No | No | No | Yes | No | No | Yes | No | Yes | No | No | Former smoker | Never used e-cigarettes in my entire life | Yes | White only, Non-Hispanic | Age 75 to 79 | 1.85 | 108.86 | 31.66 | Yes | No | No | Yes | No, did not receive any tetanus shot in the pa... | No | Yes |
| 3 | Alabama | Female | Fair | 5.0 | 0.0 | Within past year (anytime less than 12 months ... | Yes | 9.0 | None of them | No | No | No | No | Yes | No | Yes | No | Yes | No | No | No | No | Yes | No | No | Never smoked | Never used e-cigarettes in my entire life | No | White only, Non-Hispanic | Age 80 or older | 1.70 | 90.72 | 31.32 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | Yes |
| 4 | Alabama | Female | Good | 3.0 | 15.0 | Within past year (anytime less than 12 months ... | Yes | 5.0 | 1 to 5 | No | No | No | No | No | No | No | No | Yes | No | No | No | No | No | No | No | Never smoked | Never used e-cigarettes in my entire life | No | White only, Non-Hispanic | Age 80 or older | 1.55 | 79.38 | 33.07 | No | No | Yes | Yes | No, did not receive any tetanus shot in the pa... | No | No |
%matplotlib inline
sns.set_style("darkgrid")
colors6 = sns.color_palette(['#1337f5', '#E80000', '#0f1e41', '#fd523e', '#404e5c', '#c9bbaa'], 6)
colors2 = sns.color_palette(['#1337f5', '#E80000'], 2)
colors1 = sns.color_palette(['#1337f5'], 1)
numeric_vars = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'WeightInKilograms',
'BMI']
all_cols = df.columns.tolist()
subtract_set = set(numeric_vars + ['HadHeartAttack'])
categoric_vars = [col for col in all_cols if col not in subtract_set]
categoric_vars
['State', 'Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities', 'RemovedTeeth', 'HadAngina', 'HadStroke', 'HadAsthma', 'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease', 'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing', 'BlindOrVisionDifficulty', 'DifficultyConcentrating', 'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands', 'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory', 'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12', 'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos']
for i in categoric_vars:
print(df[i].value_counts())
print()
State Washington 15000 Maryland 9165 Minnesota 9161 Ohio 8995 New York 8923 Texas 7408 Florida 7315 Kansas 6145 Wisconsin 6126 Maine 6013 Iowa 5672 Hawaii 5596 Virginia 5565 Indiana 5502 South Carolina 5471 Massachusetts 5465 Arizona 5462 Utah 5373 Michigan 5370 Colorado 5159 Nebraska 5107 California 5096 Connecticut 5053 Georgia 4978 Vermont 4845 South Dakota 4405 Montana 4264 Missouri 4195 New Jersey 3967 New Hampshire 3756 Puerto Rico 3589 Idaho 3468 Alaska 3205 Rhode Island 3112 Oregon 3049 Louisiana 3010 West Virginia 2974 New Mexico 2968 Oklahoma 2941 Arkansas 2940 Pennsylvania 2729 Tennessee 2725 Illinois 2607 North Carolina 2551 North Dakota 2498 Mississippi 2438 Kentucky 2413 Wyoming 2410 Delaware 2155 Alabama 1902 Nevada 1769 District of Columbia 1725 Guam 1549 Virgin Islands 743 Name: count, dtype: int64 Sex Female 127811 Male 118211 Name: count, dtype: int64 GeneralHealth Very good 86999 Good 77409 Excellent 41525 Fair 30659 Poor 9430 Name: count, dtype: int64 LastCheckupTime Within past year (anytime less than 12 months ago) 198153 Within past 2 years (1 year but less than 2 years ago) 23227 Within past 5 years (2 years but less than 5 years ago) 13744 5 or more years ago 10898 Name: count, dtype: int64 PhysicalActivities Yes 191318 No 54704 Name: count, dtype: int64 RemovedTeeth None of them 131592 1 to 5 74702 6 or more, but not all 25950 All 13778 Name: count, dtype: int64 HadAngina No 231069 Yes 14953 Name: count, dtype: int64 HadStroke No 235910 Yes 10112 Name: count, dtype: int64 HadAsthma No 209493 Yes 36529 Name: count, dtype: int64 HadSkinCancer No 225001 Yes 21021 Name: count, dtype: int64 HadCOPD No 227028 Yes 18994 Name: count, dtype: int64 HadDepressiveDisorder No 195402 Yes 50620 Name: count, dtype: int64 HadKidneyDisease No 234738 Yes 11284 Name: count, dtype: int64 HadArthritis No 161139 Yes 84883 Name: count, dtype: int64 HadDiabetes No 204834 Yes 33813 No, pre-diabetes or borderline diabetes 5392 Yes, but only during pregnancy (female) 1983 Name: count, dtype: int64 DeafOrHardOfHearing No 224990 Yes 21032 Name: count, dtype: int64 BlindOrVisionDifficulty No 233796 Yes 12226 Name: count, dtype: int64 DifficultyConcentrating No 219802 Yes 26220 Name: count, dtype: int64 DifficultyWalking No 209952 Yes 36070 Name: count, dtype: int64 DifficultyDressingBathing No 237682 Yes 8340 Name: count, dtype: int64 DifficultyErrands No 229638 Yes 16384 Name: count, dtype: int64 SmokerStatus Never smoked 147737 Former smoker 68527 Current smoker - now smokes every day 21659 Current smoker - now smokes some days 8099 Name: count, dtype: int64 ECigaretteUsage Never used e-cigarettes in my entire life 190128 Not at all (right now) 43281 Use them some days 6658 Use them every day 5955 Name: count, dtype: int64 ChestScan No 141822 Yes 104200 Name: count, dtype: int64 RaceEthnicityCategory White only, Non-Hispanic 186336 Hispanic 22570 Black only, Non-Hispanic 19330 Other race only, Non-Hispanic 12205 Multiracial, Non-Hispanic 5581 Name: count, dtype: int64 AgeCategory Age 65 to 69 28557 Age 60 to 64 26720 Age 70 to 74 25739 Age 55 to 59 22224 Age 50 to 54 19913 Age 75 to 79 18136 Age 80 or older 17816 Age 40 to 44 16973 Age 45 to 49 16753 Age 35 to 39 15614 Age 30 to 34 13346 Age 18 to 24 13122 Age 25 to 29 11109 Name: count, dtype: int64 AlcoholDrinkers Yes 135307 No 110715 Name: count, dtype: int64 HIVTesting No 161520 Yes 84502 Name: count, dtype: int64 FluVaxLast12 Yes 131196 No 114826 Name: count, dtype: int64 PneumoVaxEver No 146130 Yes 99892 Name: count, dtype: int64 TetanusLast10Tdap No, did not receive any tetanus shot in the past 10 years 81747 Yes, received tetanus shot but not sure what type 74119 Yes, received Tdap 70286 Yes, received tetanus shot, but not Tdap 19870 Name: count, dtype: int64 HighRiskLastYear No 235446 Yes 10576 Name: count, dtype: int64 CovidPos No 167306 Yes 70324 Tested positive using home test without a health professional 8392 Name: count, dtype: int64
def show_relation(col, according_to, type_='dis'):
plt.figure(figsize=(15,7));
if type_=='dis':
sns.displot(data=df, x=col, hue=according_to, kind='kde', palette=colors2);
elif type_=='count':
if according_to != None:
perc = df.groupby(col)[according_to].value_counts(normalize=True).reset_index(name='Percentage')
sns.barplot(data=perc, x=col,y='Percentage', hue=according_to, palette=colors6, order=df[col].value_counts().index);
else:
sns.countplot(data=df, x=col, hue=according_to, palette=colors1, order=df[col].value_counts().index);
if according_to==None:
plt.title(f'{col}');
else:
plt.title(f'{col} according to {according_to}');
def generate_colors(num):
colors = []
lst = list('ABCDEF0123456789')
for i in range(num):
colors.append('#'+''.join(np.random.choice(lst, 6)))
return colors
plt.figure(figsize=(15,7));
plt.title('HadHeartAttack Count');
sns.countplot(data=df, x='HadHeartAttack', palette=colors2, order=df['HadHeartAttack'].value_counts().index);
Imbalanced data
# get percentage of attrition then convert to dicrionary
disease_size = (df.groupby('HadHeartAttack').size()*100 / len(df)).to_dict()
# create figure
fig = plt.figure(
FigureClass=Waffle, # type = waffle figure
rows=5, # rows of people
figsize = (9,3),
values=disease_size, # data
# legend labels
labels=[f"{k} ({round(v / sum(disease_size.values()) * 100, 2)}%)"
for k, v in disease_size.items()],
# colors for attrition and no attrition
colors=(colors2[0], colors2[1]),
# icons set to person for both attrition and no attriton
icons = ['heart','heart'],
# the legend at the bottom, after playing with the
# locations i centered it at the bottom
legend={'loc': 'lower center',
'bbox_to_anchor': (0.5, -0.5),
'ncol': len(disease_size),
'framealpha': 0,
'fontsize': 20
},
# size of icons (people)
icon_size=20,
# add icon to the legend at the bottom
icon_legend=True,
#title of the waffle graph
title={
'label': 'Heart Attack Per 100 People',
'loc': 'center',
'fontdict': {'fontsize': 20}
}
)
disease_size
{'No': 94.53910625878986, 'Yes': 5.4608937412101355}
df.hist(figsize=(16, 12), bins=50, color=colors1);
plt.suptitle("Distribution of Numerical Values");
obj_cols = df.select_dtypes(include='object').columns[1:]
num_cols = df.select_dtypes(exclude='object').columns
print(f'Object columns : {obj_cols}', end='\n\n')
print(f'Numberical columns : {num_cols}')
Object columns : Index(['Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities',
'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma',
'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease',
'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing',
'BlindOrVisionDifficulty', 'DifficultyConcentrating',
'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands',
'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory',
'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12',
'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos'],
dtype='object')
Numberical columns : Index(['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours',
'HeightInMeters', 'WeightInKilograms', 'BMI'],
dtype='object')
plt.figure(figsize=(20, 60))
for i in range(len(obj_cols)):
plt.subplot(17, 2, i+1)
if(df[obj_cols[i]].nunique() < 3):
ax = sns.countplot(data=df, x=obj_cols[i], palette=colors2, order=df[obj_cols[i]].value_counts().index[:6])
else:
ax = sns.countplot(data=df, x=obj_cols[i], palette=colors6, order=df[obj_cols[i]].value_counts().index[:6])
plt.title(f'{obj_cols[i]}', fontsize=15, fontweight='bold', color='brown')
plt.subplots_adjust(hspace=0.5)
for p in ax.patches:
height = p.get_height()
width = p.get_width()
percent = height/len(df)
ax.text(x=p.get_x()+width/2, y=height+2, s=format(percent, ".2%"), fontsize=12, ha='center', weight='bold')
Features are also unbalanced
ax.patches
<Axes.ArtistList of 3 patches>
df[(df['HadHeartAttack'] == 'Yes')]['Sex'].value_counts()/df.shape[0]*100, 2
(Sex Male 3.456195 Female 2.004699 Name: count, dtype: float64, 2)
df.shape
(246022, 40)
Most of people in our data are white and have no diabetic.
Left column is distribution without considering HadHeartAttack's yes or no. Right column is distribution with HadHeartAttack's yes.
obj_cols
Index(['Sex', 'GeneralHealth', 'LastCheckupTime', 'PhysicalActivities',
'RemovedTeeth', 'HadHeartAttack', 'HadAngina', 'HadStroke', 'HadAsthma',
'HadSkinCancer', 'HadCOPD', 'HadDepressiveDisorder', 'HadKidneyDisease',
'HadArthritis', 'HadDiabetes', 'DeafOrHardOfHearing',
'BlindOrVisionDifficulty', 'DifficultyConcentrating',
'DifficultyWalking', 'DifficultyDressingBathing', 'DifficultyErrands',
'SmokerStatus', 'ECigaretteUsage', 'ChestScan', 'RaceEthnicityCategory',
'AgeCategory', 'AlcoholDrinkers', 'HIVTesting', 'FluVaxLast12',
'PneumoVaxEver', 'TetanusLast10Tdap', 'HighRiskLastYear', 'CovidPos'],
dtype='object')
for col in obj_cols:
fig, ax = plt.subplots(1,2, figsize=(20,20))
round(df[col].value_counts()/df.shape[0]*100, 2).plot.pie(autopct="%1.2f%%", ax=ax[0], textprops={"color":"white"}, colors=colors6, radius = 0.9)
round(df[(df['HadHeartAttack'] == 'Yes')][col].value_counts()/df.shape[0]*100, 2).plot.pie(autopct="%1.2f%%", ax=ax[1], textprops={"color":"white"},colors=colors6, radius = 0.9)
plt.legend(loc="upper right", bbox_to_anchor=(1, 0, 0.5, 1))
plt.title(f'{col}', fontsize=15, fontweight='bold', color='brown')
plt.show();
plt.tight_layout()
<Figure size 640x480 with 0 Axes>
def show_relation(col, according_to, type_='dis'):
plt.figure(figsize=(15,7));
if type_=='dis':
sns.displot(data=df, x=col, hue=according_to, kind='kde', palette=colors2);
elif type_=='count':
if according_to != None:
perc = df.groupby(col)[according_to].value_counts(normalize=True).reset_index(name='Percentage')
sns.barplot(data=perc, x=col,y='Percentage', hue=according_to, palette=colors6, order=df[col].value_counts().index);
else:
sns.countplot(data=df, x=col, hue=according_to, palette=colors1, order=df[col].value_counts().index);
if according_to==None:
plt.title(f'{col}');
else:
plt.title(f'{col} according to {according_to}');
num_cols[5]
'BMI'
show_relation(num_cols[5], 'HadHeartAttack');
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
<Figure size 1500x700 with 0 Axes>
plt.figure(figsize=(16, 6), dpi=80)
sns.boxplot(data=df, x='BMI', y='HadHeartAttack', saturation=0.4,
width=0.15, boxprops={'zorder': 2},
showfliers = False, whis=0, palette=colors2);
sns.violinplot(data=df, x='BMI', y='HadHeartAttack',inner='quartile', palette=colors2);
BMI didnt affect Heart Attack
show_relation(num_cols[1], 'HadHeartAttack')
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1075: FutureWarning: When grouping with a length-1 list-like, you will need to pass a length-1 tuple to get_group in a future version of pandas. Pass `(name,)` instead of `name` to silence this warning.
data_subset = grouped_data.get_group(pd_key)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/axisgrid.py:118: UserWarning: The figure layout has changed to tight
self._figure.tight_layout(*args, **kwargs)
<Figure size 1500x700 with 0 Axes>
obj_cols[25]
'AgeCategory'
df.groupby('AgeCategory')['HadHeartAttack'].value_counts(normalize=True).reset_index(name='Percentage')
| AgeCategory | HadHeartAttack | Percentage | |
|---|---|---|---|
| 0 | Age 18 to 24 | No | 0.996190 |
| 1 | Age 18 to 24 | Yes | 0.003810 |
| 2 | Age 25 to 29 | No | 0.995769 |
| 3 | Age 25 to 29 | Yes | 0.004231 |
| 4 | Age 30 to 34 | No | 0.993256 |
| 5 | Age 30 to 34 | Yes | 0.006744 |
| 6 | Age 35 to 39 | No | 0.990009 |
| 7 | Age 35 to 39 | Yes | 0.009991 |
| 8 | Age 40 to 44 | No | 0.986567 |
| 9 | Age 40 to 44 | Yes | 0.013433 |
| 10 | Age 45 to 49 | No | 0.974930 |
| 11 | Age 45 to 49 | Yes | 0.025070 |
| 12 | Age 50 to 54 | No | 0.964696 |
| 13 | Age 50 to 54 | Yes | 0.035304 |
| 14 | Age 55 to 59 | No | 0.949964 |
| 15 | Age 55 to 59 | Yes | 0.050036 |
| 16 | Age 60 to 64 | No | 0.941055 |
| 17 | Age 60 to 64 | Yes | 0.058945 |
| 18 | Age 65 to 69 | No | 0.924537 |
| 19 | Age 65 to 69 | Yes | 0.075463 |
| 20 | Age 70 to 74 | No | 0.906445 |
| 21 | Age 70 to 74 | Yes | 0.093555 |
| 22 | Age 75 to 79 | No | 0.886138 |
| 23 | Age 75 to 79 | Yes | 0.113862 |
| 24 | Age 80 or older | No | 0.863830 |
| 25 | Age 80 or older | Yes | 0.136170 |
show_relation(obj_cols[25], 'HadHeartAttack', type_='count')
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(20, 6))
sns.histplot(data=df.loc[df.HadHeartAttack == 'No'].sort_values("AgeCategory"), x='AgeCategory', ax=ax1);
ax1.set_title("Age Distribution of Poeple Without Heart Attack")
sns.histplot(data=df.loc[df.HadHeartAttack == 'Yes'].sort_values("AgeCategory"), x='AgeCategory',
color=colors2[1], ax=ax2);
ax2.set_title("Age Distribution of Heart Attack Patients")
fig.tight_layout()
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
show_relation(obj_cols[0], 'HadHeartAttack', type_='count')
show_relation(obj_cols[21], 'HadHeartAttack', type_='count')
We can observe that the people who are smoking are more susceptible to the heart attack.
show_relation(obj_cols[22], 'HadHeartAttack', type_='count')
show_relation(obj_cols[23], 'HadHeartAttack', type_='count')
show_relation(obj_cols[24], 'HadHeartAttack', type_='count')
show_relation(obj_cols[26], 'HadHeartAttack', type_='count')
show_relation(obj_cols[3], 'HadHeartAttack', type_='count')
show_relation(obj_cols[4], 'HadHeartAttack', type_='count')
Number of Removed Teeth can represent patient's age
show_relation(obj_cols[6], 'HadHeartAttack', type_='count')
show_relation(obj_cols[7], 'HadHeartAttack', type_='count')
show_relation(obj_cols[8], 'HadHeartAttack', type_='count')
show_relation(obj_cols[9], 'HadHeartAttack', type_='count')
show_relation(obj_cols[10], 'HadHeartAttack', type_='count')
show_relation(obj_cols[11], 'HadHeartAttack', type_='count')
show_relation(obj_cols[12], 'HadHeartAttack', type_='count')
show_relation(obj_cols[13], 'HadHeartAttack', type_='count')
show_relation(obj_cols[14], 'HadHeartAttack', type_='count')
show_relation(obj_cols[15], 'HadHeartAttack', type_='count')
fig, ax = plt.subplots(figsize = (14,6))
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
sns.kdeplot(df[df["HadAsthma"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
sns.kdeplot(df[df["HadStroke"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[5], label="HadDiabetic", ax = ax)
ax.set_xlabel("BMI")
ax.set_ylabel("Frequency")
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.show()
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:2: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:3: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:4: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:5: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadAsthma"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:6: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadStroke"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/2147422513.py:7: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["BMI"], alpha=1,shade = False, color=colors6[5], label="HadDiabetic", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
fig, ax = plt.subplots(figsize = (14,6))
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
sns.kdeplot(df[df["HadAsthma"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
sns.kdeplot(df[df["HadStroke"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)
ax.set_xlabel("MentalHealthDays")
ax.set_ylabel("Frequency")
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.show()
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:2: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:3: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:4: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:5: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadAsthma"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:6: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadStroke"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/1491682513.py:7: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["MentalHealthDays"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
fig, ax = plt.subplots(figsize = (14,6))
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
sns.kdeplot(df[df["HadAsthma"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
sns.kdeplot(df[df["HadStroke"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)
ax.set_xlabel("SleepHours")
ax.set_ylabel("Frequency")
ax.legend(bbox_to_anchor=(1.02, 1), loc=2, borderaxespad=0.)
plt.show()
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:2: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadHeartAttack"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[0], label="HadHeartAttack", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:3: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadKidneyDisease"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[1], label="HadKidneyDisease", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:4: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadSkinCancer"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[2], label="HadSkinCancer", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:5: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadAsthma"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[3], label="HadAsthma", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:6: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadStroke"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[4], label="HadStroke", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
/var/folders/q2/_y_3vnwd37j_pj9njcz290gw0000gn/T/ipykernel_46596/3313891655.py:7: FutureWarning:
`shade` is now deprecated in favor of `fill`; setting `fill=False`.
This will become an error in seaborn v0.14.0; please update your code.
sns.kdeplot(df[df["HadDiabetes"]=='Yes']["SleepHours"], alpha=1,shade = False, color=colors6[5], label="HadDiabetes", ax = ax)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/seaborn/_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
region_mapping = {
'Connecticut': 'Northeast', 'Maine': 'Northeast', 'Massachusetts': 'Northeast',
'New Hampshire': 'Northeast', 'New Jersey': 'Northeast', 'New York': 'Northeast',
'Pennsylvania': 'Northeast', 'Rhode Island': 'Northeast', 'Vermont': 'Northeast',
'Illinois': 'Midwest', 'Indiana': 'Midwest', 'Iowa': 'Midwest', 'Kansas': 'Midwest',
'Michigan': 'Midwest', 'Minnesota': 'Midwest', 'Missouri': 'Midwest', 'Nebraska': 'Midwest',
'North Dakota': 'Midwest', 'Ohio': 'Midwest', 'South Dakota': 'Midwest', 'Wisconsin': 'Midwest',
'Alabama': 'South', 'Arkansas': 'South', 'Delaware': 'South', 'Florida': 'South',
'Georgia': 'South', 'Kentucky': 'South', 'Louisiana': 'South', 'Maryland': 'South',
'Mississippi': 'South', 'North Carolina': 'South', 'Oklahoma': 'South', 'South Carolina': 'South',
'Tennessee': 'South', 'Texas': 'South', 'Virginia': 'South', 'West Virginia': 'South',
'Alaska': 'West', 'Arizona': 'West', 'California': 'West', 'Colorado': 'West',
'Hawaii': 'West', 'Idaho': 'West', 'Montana': 'West', 'Nevada': 'West',
'New Mexico': 'West', 'Oregon': 'West', 'Utah': 'West', 'Washington': 'West', 'Wyoming': 'West'
}
df['Location'] = df['State'].map(region_mapping)
# Drop the original column
df.drop('State', axis=1, inplace=True)
df['Location'].value_counts()
Location South 65951 Midwest 65783 West 62819 Northeast 43863 Name: count, dtype: int64
df['HadHeartAttack'].value_counts()
#17:1
HadHeartAttack No 232587 Yes 13435 Name: count, dtype: int64
Check correlation matrix for numeric columns
correlation_matrix = df[numeric_vars].corr()
print(correlation_matrix)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
PhysicalHealthDays MentalHealthDays SleepHours \
PhysicalHealthDays 1.000000 0.306800 -0.056063
MentalHealthDays 0.306800 1.000000 -0.130100
SleepHours -0.056063 -0.130100 1.000000
HeightInMeters -0.049180 -0.056010 -0.011384
WeightInKilograms 0.077505 0.042441 -0.054691
BMI 0.116905 0.082182 -0.054750
HeightInMeters WeightInKilograms BMI
PhysicalHealthDays -0.049180 0.077505 0.116905
MentalHealthDays -0.056010 0.042441 0.082182
SleepHours -0.011384 -0.054691 -0.054750
HeightInMeters 1.000000 0.473768 -0.026637
WeightInKilograms 0.473768 1.000000 0.859313
BMI -0.026637 0.859313 1.000000
Now drop 'WeightInKilograms'
numeric_vars = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'BMI']
df.drop('WeightInKilograms', axis=1, inplace=True)
correlation_matrix = df[numeric_vars].corr()
print(correlation_matrix)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.show()
PhysicalHealthDays MentalHealthDays SleepHours \
PhysicalHealthDays 1.000000 0.306800 -0.056063
MentalHealthDays 0.306800 1.000000 -0.130100
SleepHours -0.056063 -0.130100 1.000000
HeightInMeters -0.049180 -0.056010 -0.011384
BMI 0.116905 0.082182 -0.054750
HeightInMeters BMI
PhysicalHealthDays -0.049180 0.116905
MentalHealthDays -0.056010 0.082182
SleepHours -0.011384 -0.054750
HeightInMeters 1.000000 -0.026637
BMI -0.026637 1.000000
numeric_df = df[numeric_vars]
X = add_constant(numeric_df)
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data)
Variable VIF 0 const 311.539297 1 PhysicalHealthDays 1.115642 2 MentalHealthDays 1.124034 3 SleepHours 1.019833 4 HeightInMeters 1.005062 5 BMI 1.018593
categoric_vars.remove('State')
categoric_vars.append('HadHeartAttack')
categoric_vars.append('Location')
len(categoric_vars)
34
df_encoded = pd.get_dummies(df, columns=categoric_vars, drop_first=True)
pd.set_option('display.max_columns', None)
df_encoded.head(10)
| PhysicalHealthDays | MentalHealthDays | SleepHours | HeightInMeters | BMI | Sex_Male | GeneralHealth_Fair | GeneralHealth_Good | GeneralHealth_Poor | GeneralHealth_Very good | LastCheckupTime_Within past 2 years (1 year but less than 2 years ago) | LastCheckupTime_Within past 5 years (2 years but less than 5 years ago) | LastCheckupTime_Within past year (anytime less than 12 months ago) | PhysicalActivities_Yes | RemovedTeeth_6 or more, but not all | RemovedTeeth_All | RemovedTeeth_None of them | HadAngina_Yes | HadStroke_Yes | HadAsthma_Yes | HadSkinCancer_Yes | HadCOPD_Yes | HadDepressiveDisorder_Yes | HadKidneyDisease_Yes | HadArthritis_Yes | HadDiabetes_No, pre-diabetes or borderline diabetes | HadDiabetes_Yes | HadDiabetes_Yes, but only during pregnancy (female) | DeafOrHardOfHearing_Yes | BlindOrVisionDifficulty_Yes | DifficultyConcentrating_Yes | DifficultyWalking_Yes | DifficultyDressingBathing_Yes | DifficultyErrands_Yes | SmokerStatus_Current smoker - now smokes some days | SmokerStatus_Former smoker | SmokerStatus_Never smoked | ECigaretteUsage_Not at all (right now) | ECigaretteUsage_Use them every day | ECigaretteUsage_Use them some days | ChestScan_Yes | RaceEthnicityCategory_Hispanic | RaceEthnicityCategory_Multiracial, Non-Hispanic | RaceEthnicityCategory_Other race only, Non-Hispanic | RaceEthnicityCategory_White only, Non-Hispanic | AgeCategory_Age 25 to 29 | AgeCategory_Age 30 to 34 | AgeCategory_Age 35 to 39 | AgeCategory_Age 40 to 44 | AgeCategory_Age 45 to 49 | AgeCategory_Age 50 to 54 | AgeCategory_Age 55 to 59 | AgeCategory_Age 60 to 64 | AgeCategory_Age 65 to 69 | AgeCategory_Age 70 to 74 | AgeCategory_Age 75 to 79 | AgeCategory_Age 80 or older | AlcoholDrinkers_Yes | HIVTesting_Yes | FluVaxLast12_Yes | PneumoVaxEver_Yes | TetanusLast10Tdap_Yes, received Tdap | TetanusLast10Tdap_Yes, received tetanus shot but not sure what type | TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap | HighRiskLastYear_Yes | CovidPos_Tested positive using home test without a health professional | CovidPos_Yes | HadHeartAttack_Yes | Location_Northeast | Location_South | Location_West | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4.0 | 0.0 | 9.0 | 1.60 | 27.99 | False | False | False | False | True | False | False | True | True | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | True | True | True | False | False | False | False | False | False | False | True | False |
| 1 | 0.0 | 0.0 | 6.0 | 1.78 | 30.13 | True | False | False | False | True | False | False | True | True | False | False | True | False | False | False | False | False | False | False | True | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | True | False | False | False | False | True | True | False | True | False | False | False | False | False | False | True | False |
| 2 | 0.0 | 0.0 | 8.0 | 1.85 | 31.66 | True | False | False | False | True | False | False | True | False | True | False | False | False | False | False | False | False | False | False | True | False | False | False | False | True | False | True | False | False | False | True | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | True | False | False | True | False | False | False | False | False | True | False | False | True | False |
| 3 | 5.0 | 0.0 | 9.0 | 1.70 | 31.32 | False | True | False | False | False | False | False | True | True | False | False | True | False | False | False | True | False | True | False | True | False | False | False | False | False | False | True | False | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | True | False | False | False | False | False | True | False | False | True | False |
| 4 | 3.0 | 15.0 | 5.0 | 1.55 | 33.07 | False | False | True | False | False | False | False | True | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | True | False | False | False | False | False | False | False | False | True | False |
| 5 | 0.0 | 0.0 | 7.0 | 1.85 | 34.96 | True | False | True | False | False | False | False | True | True | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | True | False | False | False | False | False | True | False | False | False | False | False | False | True | True | True | False | False | True | False | False | False | False | False | False | True | False |
| 6 | 3.0 | 0.0 | 8.0 | 1.63 | 33.30 | False | False | True | False | False | False | False | True | True | True | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | True | False | False | False | False | False | False | False | False | True | False |
| 7 | 5.0 | 0.0 | 8.0 | 1.75 | 24.37 | True | True | False | False | False | False | False | True | True | False | False | False | True | False | False | True | False | False | False | True | False | True | False | False | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | True | True | True | False | False | False | False | False | True | True | False | True | False |
| 8 | 2.0 | 0.0 | 6.0 | 1.70 | 26.94 | True | False | True | False | False | False | False | False | False | False | False | True | False | False | False | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | True | False | False | False | False | True | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | True | False | False | True | False |
| 9 | 0.0 | 0.0 | 7.0 | 1.68 | 22.60 | False | False | False | False | True | False | False | True | True | False | False | True | False | False | True | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | False | True | False | False | False | True | False | False | False | False | False | False | False | False | False | False | True | False | False | False | True | True | False | False | False | False | False | False | False | False | True | False |
pd.set_option('display.max_rows', None)
df_encoded.dtypes
PhysicalHealthDays float64 MentalHealthDays float64 SleepHours float64 HeightInMeters float64 BMI float64 Sex_Male bool GeneralHealth_Fair bool GeneralHealth_Good bool GeneralHealth_Poor bool GeneralHealth_Very good bool LastCheckupTime_Within past 2 years (1 year but less than 2 years ago) bool LastCheckupTime_Within past 5 years (2 years but less than 5 years ago) bool LastCheckupTime_Within past year (anytime less than 12 months ago) bool PhysicalActivities_Yes bool RemovedTeeth_6 or more, but not all bool RemovedTeeth_All bool RemovedTeeth_None of them bool HadAngina_Yes bool HadStroke_Yes bool HadAsthma_Yes bool HadSkinCancer_Yes bool HadCOPD_Yes bool HadDepressiveDisorder_Yes bool HadKidneyDisease_Yes bool HadArthritis_Yes bool HadDiabetes_No, pre-diabetes or borderline diabetes bool HadDiabetes_Yes bool HadDiabetes_Yes, but only during pregnancy (female) bool DeafOrHardOfHearing_Yes bool BlindOrVisionDifficulty_Yes bool DifficultyConcentrating_Yes bool DifficultyWalking_Yes bool DifficultyDressingBathing_Yes bool DifficultyErrands_Yes bool SmokerStatus_Current smoker - now smokes some days bool SmokerStatus_Former smoker bool SmokerStatus_Never smoked bool ECigaretteUsage_Not at all (right now) bool ECigaretteUsage_Use them every day bool ECigaretteUsage_Use them some days bool ChestScan_Yes bool RaceEthnicityCategory_Hispanic bool RaceEthnicityCategory_Multiracial, Non-Hispanic bool RaceEthnicityCategory_Other race only, Non-Hispanic bool RaceEthnicityCategory_White only, Non-Hispanic bool AgeCategory_Age 25 to 29 bool AgeCategory_Age 30 to 34 bool AgeCategory_Age 35 to 39 bool AgeCategory_Age 40 to 44 bool AgeCategory_Age 45 to 49 bool AgeCategory_Age 50 to 54 bool AgeCategory_Age 55 to 59 bool AgeCategory_Age 60 to 64 bool AgeCategory_Age 65 to 69 bool AgeCategory_Age 70 to 74 bool AgeCategory_Age 75 to 79 bool AgeCategory_Age 80 or older bool AlcoholDrinkers_Yes bool HIVTesting_Yes bool FluVaxLast12_Yes bool PneumoVaxEver_Yes bool TetanusLast10Tdap_Yes, received Tdap bool TetanusLast10Tdap_Yes, received tetanus shot but not sure what type bool TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap bool HighRiskLastYear_Yes bool CovidPos_Tested positive using home test without a health professional bool CovidPos_Yes bool HadHeartAttack_Yes bool Location_Northeast bool Location_South bool Location_West bool dtype: object
df_encoded.shape
(246022, 71)
df_encoded = df_encoded.astype(int)
pd.set_option('display.max_columns', None)
df_encoded.head(10)
| PhysicalHealthDays | MentalHealthDays | SleepHours | HeightInMeters | BMI | Sex_Male | GeneralHealth_Fair | GeneralHealth_Good | GeneralHealth_Poor | GeneralHealth_Very good | LastCheckupTime_Within past 2 years (1 year but less than 2 years ago) | LastCheckupTime_Within past 5 years (2 years but less than 5 years ago) | LastCheckupTime_Within past year (anytime less than 12 months ago) | PhysicalActivities_Yes | RemovedTeeth_6 or more, but not all | RemovedTeeth_All | RemovedTeeth_None of them | HadAngina_Yes | HadStroke_Yes | HadAsthma_Yes | HadSkinCancer_Yes | HadCOPD_Yes | HadDepressiveDisorder_Yes | HadKidneyDisease_Yes | HadArthritis_Yes | HadDiabetes_No, pre-diabetes or borderline diabetes | HadDiabetes_Yes | HadDiabetes_Yes, but only during pregnancy (female) | DeafOrHardOfHearing_Yes | BlindOrVisionDifficulty_Yes | DifficultyConcentrating_Yes | DifficultyWalking_Yes | DifficultyDressingBathing_Yes | DifficultyErrands_Yes | SmokerStatus_Current smoker - now smokes some days | SmokerStatus_Former smoker | SmokerStatus_Never smoked | ECigaretteUsage_Not at all (right now) | ECigaretteUsage_Use them every day | ECigaretteUsage_Use them some days | ChestScan_Yes | RaceEthnicityCategory_Hispanic | RaceEthnicityCategory_Multiracial, Non-Hispanic | RaceEthnicityCategory_Other race only, Non-Hispanic | RaceEthnicityCategory_White only, Non-Hispanic | AgeCategory_Age 25 to 29 | AgeCategory_Age 30 to 34 | AgeCategory_Age 35 to 39 | AgeCategory_Age 40 to 44 | AgeCategory_Age 45 to 49 | AgeCategory_Age 50 to 54 | AgeCategory_Age 55 to 59 | AgeCategory_Age 60 to 64 | AgeCategory_Age 65 to 69 | AgeCategory_Age 70 to 74 | AgeCategory_Age 75 to 79 | AgeCategory_Age 80 or older | AlcoholDrinkers_Yes | HIVTesting_Yes | FluVaxLast12_Yes | PneumoVaxEver_Yes | TetanusLast10Tdap_Yes, received Tdap | TetanusLast10Tdap_Yes, received tetanus shot but not sure what type | TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap | HighRiskLastYear_Yes | CovidPos_Tested positive using home test without a health professional | CovidPos_Yes | HadHeartAttack_Yes | Location_Northeast | Location_South | Location_West | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 0 | 9 | 1 | 27 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 0 | 6 | 1 | 30 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 8 | 1 | 31 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 5 | 0 | 9 | 1 | 31 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 4 | 3 | 15 | 5 | 1 | 33 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 5 | 0 | 0 | 7 | 1 | 34 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 6 | 3 | 0 | 8 | 1 | 33 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 7 | 5 | 0 | 8 | 1 | 24 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
| 8 | 2 | 0 | 6 | 1 | 26 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| 9 | 0 | 0 | 7 | 1 | 22 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
X = df_encoded.drop('HadHeartAttack_Yes', axis=1)
y = df_encoded['HadHeartAttack_Yes']
from interpret.glassbox import LogisticRegression
X_train , X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
scaler= StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
logreg_ss=LogisticRegression(solver="liblinear", penalty="l2", C=0.00001, max_iter=10000)
logreg_ss.fit(X_train_scaled,y_train)
y_pred_log=logreg_ss.predict(X_test_scaled)
y_pred_proba_log = logreg_ss.predict_proba(X_test_scaled)
fpr_log, tpr_log, _ = metrics.roc_curve(y_test, y_pred_proba_log[:,1])
auc_log = round(metrics.auc(fpr_log, tpr_log),5)
simple_log = pd.DataFrame(data=[accuracy_score(y_test, y_pred_log),
precision_score(y_test, y_pred_log, average='binary'),
recall_score(y_test, y_pred_log, average='binary'),
f1_score(y_test, y_pred_log, average='binary'),
roc_auc_score(y_test, y_pred_proba_log[:,1])],
index=['Accuracy','Precision','Recall','F1-score','AUC'],
columns = ["Logestic_regression_simple"])
simple_log
| Logestic_regression_simple | |
|---|---|
| Accuracy | 0.946550 |
| Precision | 0.531425 |
| Recall | 0.345368 |
| F1-score | 0.418656 |
| AUC | 0.889810 |
plt.figure(figsize = (10,5))
ax=sns.countplot(data=df_encoded , x = 'HadHeartAttack_Yes')
for container in ax.containers:
ax.bar_label(container, label_type='center', rotation=0, color='white')
plt.title("Distribution before resampling", size=16)
plt.show()
pip install -U imbalanced-learn
Requirement already satisfied: imbalanced-learn in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (0.12.0) Requirement already satisfied: numpy>=1.17.3 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.24.3) Requirement already satisfied: scipy>=1.5.0 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.11.1) Requirement already satisfied: scikit-learn>=1.0.2 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.2.2) Requirement already satisfied: joblib>=1.1.1 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (1.2.0) Requirement already satisfied: threadpoolctl>=2.0.0 in /Users/kyusungcho/anaconda3/lib/python3.11/site-packages (from imbalanced-learn) (2.2.0) Note: you may need to restart the kernel to use updated packages.
smote = SMOTE(n_jobs=-1, random_state=0)
X_smote, y_smote = smote.fit_resample(X, y)
plt.figure(figsize = (10,6))
ax=sns.countplot( x = y_smote ,)
for container in ax.containers:
ax.bar_label(container, label_type='center', rotation=0, color='white')
plt.title("Distribution After SMOTE", size=14)
plt.show()
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/imblearn/over_sampling/_smote/base.py:363: FutureWarning: The parameter `n_jobs` has been deprecated in 0.10 and will be removed in 0.12. You can pass an nearest neighbors estimator where `n_jobs` is already set instead. warnings.warn(
X_train , X_test, y_train, y_test = train_test_split(X_smote, y_smote , test_size=0.2, random_state=0)
scaler= StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
logreg_s=LogisticRegression(solver="liblinear", penalty="l2", C=0.00001, max_iter=10000)
logreg_s.fit(X_train_scaled,y_train)
y_pred_log=logreg_s.predict(X_test_scaled)
y_pred_proba_log = logreg_s.predict_proba(X_test_scaled)
fpr_log, tpr_log, _ = metrics.roc_curve(y_test, y_pred_proba_log[:,1])
auc_log = round(metrics.auc(fpr_log, tpr_log),5)
SMOTE_log = pd.DataFrame(data=[accuracy_score(y_test, y_pred_log),
precision_score(y_test, y_pred_log, average='binary'),
recall_score(y_test, y_pred_log, average='binary'),
f1_score(y_test, y_pred_log, average='binary'),
roc_auc_score(y_test, y_pred_proba_log[:,1])],
index=['Accuracy','Precision','Recall','F1-score','AUC'],
columns = ["Logestic_regression_smote"])
SMOTE_log
| Logestic_regression_smote | |
|---|---|
| Accuracy | 0.885312 |
| Precision | 0.874661 |
| Recall | 0.898344 |
| F1-score | 0.886344 |
| AUC | 0.953757 |
X_smote_const = sm.add_constant(X_smote)
def backward_elimination(X, y, significance_level=0.05):
num_vars = X.shape[1]
for i in range(num_vars):
model = sm.Logit(y, X).fit(disp=0)
max_p_value = max(model.pvalues.iloc[1:]) # 상수항 제외
feature_with_max_p_value = model.pvalues.iloc[1:].idxmax()
if max_p_value > significance_level:
X = X.drop([feature_with_max_p_value], axis=1)
else:
break
return model, X.columns.tolist()
final_logit_model, final_features = backward_elimination(X_smote_const, y_smote)
print(final_logit_model.summary())
print("Final features selected:", final_features)
Logit Regression Results
==============================================================================
Dep. Variable: HadHeartAttack_Yes No. Observations: 465174
Model: Logit Df Residuals: 465103
Method: MLE Df Model: 70
Date: Tue, 05 Mar 2024 Pseudo R-squ.: 0.6838
Time: 17:42:13 Log-Likelihood: -1.0196e+05
converged: True LL-Null: -3.2243e+05
Covariance Type: nonrobust LLR p-value: 0.000
===========================================================================================================================================
coef std err z P>|z| [0.025 0.975]
-------------------------------------------------------------------------------------------------------------------------------------------
const 9.6304 0.222 43.434 0.000 9.196 10.065
PhysicalHealthDays 0.0138 0.001 16.760 0.000 0.012 0.015
MentalHealthDays -0.0048 0.001 -5.019 0.000 -0.007 -0.003
SleepHours -0.1309 0.004 -32.147 0.000 -0.139 -0.123
HeightInMeters -1.1759 0.214 -5.503 0.000 -1.595 -0.757
BMI 0.0186 0.001 17.665 0.000 0.017 0.021
Sex_Male 0.3892 0.012 31.881 0.000 0.365 0.413
GeneralHealth_Fair -0.9913 0.022 -45.644 0.000 -1.034 -0.949
GeneralHealth_Good -1.0248 0.016 -64.260 0.000 -1.056 -0.994
GeneralHealth_Poor -0.4666 0.035 -13.271 0.000 -0.536 -0.398
GeneralHealth_Very good -1.2513 0.016 -76.257 0.000 -1.283 -1.219
LastCheckupTime_Within past 2 years (1 year but less than 2 years ago) -2.2381 0.043 -51.822 0.000 -2.323 -2.153
LastCheckupTime_Within past 5 years (2 years but less than 5 years ago) -2.6071 0.060 -43.246 0.000 -2.725 -2.489
LastCheckupTime_Within past year (anytime less than 12 months ago) -0.4060 0.025 -16.032 0.000 -0.456 -0.356
PhysicalActivities_Yes -0.4414 0.013 -32.817 0.000 -0.468 -0.415
RemovedTeeth_6 or more, but not all -1.0589 0.019 -56.007 0.000 -1.096 -1.022
RemovedTeeth_All -0.9856 0.023 -42.520 0.000 -1.031 -0.940
RemovedTeeth_None of them -1.2603 0.014 -92.827 0.000 -1.287 -1.234
HadAngina_Yes 2.5544 0.019 137.029 0.000 2.518 2.591
HadStroke_Yes 0.4309 0.026 16.372 0.000 0.379 0.483
HadAsthma_Yes -0.6426 0.022 -29.758 0.000 -0.685 -0.600
HadSkinCancer_Yes -0.6174 0.021 -29.043 0.000 -0.659 -0.576
HadCOPD_Yes -0.3713 0.023 -16.373 0.000 -0.416 -0.327
HadDepressiveDisorder_Yes -0.4059 0.021 -19.621 0.000 -0.446 -0.365
HadKidneyDisease_Yes -0.4088 0.028 -14.713 0.000 -0.463 -0.354
HadArthritis_Yes -0.1141 0.013 -8.924 0.000 -0.139 -0.089
HadDiabetes_No, pre-diabetes or borderline diabetes -1.3309 0.061 -21.840 0.000 -1.450 -1.212
HadDiabetes_Yes -0.1030 0.016 -6.328 0.000 -0.135 -0.071
HadDiabetes_Yes, but only during pregnancy (female) -1.0337 0.148 -6.963 0.000 -1.325 -0.743
DeafOrHardOfHearing_Yes -0.3241 0.020 -16.033 0.000 -0.364 -0.285
BlindOrVisionDifficulty_Yes -0.3345 0.031 -10.825 0.000 -0.395 -0.274
DifficultyConcentrating_Yes -0.3266 0.026 -12.693 0.000 -0.377 -0.276
DifficultyWalking_Yes -0.2507 0.019 -13.217 0.000 -0.288 -0.213
DifficultyDressingBathing_Yes -0.2140 0.038 -5.573 0.000 -0.289 -0.139
DifficultyErrands_Yes -0.2384 0.029 -8.199 0.000 -0.295 -0.181
SmokerStatus_Current smoker - now smokes some days -2.0039 0.049 -40.880 0.000 -2.100 -1.908
SmokerStatus_Former smoker -1.4353 0.019 -77.114 0.000 -1.472 -1.399
SmokerStatus_Never smoked -2.1291 0.018 -116.406 0.000 -2.165 -2.093
ECigaretteUsage_Not at all (right now) -1.1767 0.020 -59.381 0.000 -1.215 -1.138
ECigaretteUsage_Use them every day -2.3336 0.083 -28.109 0.000 -2.496 -2.171
ECigaretteUsage_Use them some days -2.0157 0.069 -29.004 0.000 -2.152 -1.879
ChestScan_Yes 0.6380 0.012 52.482 0.000 0.614 0.662
RaceEthnicityCategory_Hispanic -1.8906 0.034 -54.990 0.000 -1.958 -1.823
RaceEthnicityCategory_Multiracial, Non-Hispanic -1.7094 0.063 -26.933 0.000 -1.834 -1.585
RaceEthnicityCategory_Other race only, Non-Hispanic -1.9957 0.044 -45.742 0.000 -2.081 -1.910
RaceEthnicityCategory_White only, Non-Hispanic -0.6549 0.018 -35.957 0.000 -0.691 -0.619
AgeCategory_Age 25 to 29 -5.7152 0.131 -43.517 0.000 -5.973 -5.458
AgeCategory_Age 30 to 34 -5.7736 0.103 -55.876 0.000 -5.976 -5.571
AgeCategory_Age 35 to 39 -5.2732 0.069 -76.302 0.000 -5.409 -5.138
AgeCategory_Age 40 to 44 -5.0994 0.058 -88.679 0.000 -5.212 -4.987
AgeCategory_Age 45 to 49 -4.4927 0.044 -102.160 0.000 -4.579 -4.407
AgeCategory_Age 50 to 54 -4.0281 0.034 -119.062 0.000 -4.094 -3.962
AgeCategory_Age 55 to 59 -3.6575 0.028 -130.392 0.000 -3.713 -3.603
AgeCategory_Age 60 to 64 -3.3886 0.024 -139.027 0.000 -3.436 -3.341
AgeCategory_Age 65 to 69 -3.0787 0.022 -137.410 0.000 -3.123 -3.035
AgeCategory_Age 70 to 74 -2.7478 0.022 -124.357 0.000 -2.791 -2.704
AgeCategory_Age 75 to 79 -2.6979 0.024 -112.564 0.000 -2.745 -2.651
AgeCategory_Age 80 or older -2.3676 0.024 -99.454 0.000 -2.414 -2.321
AlcoholDrinkers_Yes -0.6844 0.012 -55.688 0.000 -0.708 -0.660
HIVTesting_Yes -0.4166 0.016 -26.190 0.000 -0.448 -0.385
FluVaxLast12_Yes -0.1687 0.013 -13.185 0.000 -0.194 -0.144
PneumoVaxEver_Yes 0.2566 0.014 18.699 0.000 0.230 0.284
TetanusLast10Tdap_Yes, received Tdap -1.1001 0.017 -65.768 0.000 -1.133 -1.067
TetanusLast10Tdap_Yes, received tetanus shot but not sure what type -0.8405 0.014 -60.862 0.000 -0.868 -0.813
TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap -1.4708 0.027 -54.175 0.000 -1.524 -1.418
HighRiskLastYear_Yes -1.2989 0.060 -21.826 0.000 -1.416 -1.182
CovidPos_Tested positive using home test without a health professional -1.8857 0.068 -27.755 0.000 -2.019 -1.753
CovidPos_Yes -0.7648 0.016 -48.600 0.000 -0.796 -0.734
Location_Northeast -1.2580 0.018 -69.188 0.000 -1.294 -1.222
Location_South -1.2565 0.015 -84.610 0.000 -1.286 -1.227
Location_West -1.0043 0.016 -62.753 0.000 -1.036 -0.973
===========================================================================================================================================
Final features selected: ['const', 'PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'BMI', 'Sex_Male', 'GeneralHealth_Fair', 'GeneralHealth_Good', 'GeneralHealth_Poor', 'GeneralHealth_Very good', 'LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)', 'LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)', 'LastCheckupTime_Within past year (anytime less than 12 months ago)', 'PhysicalActivities_Yes', 'RemovedTeeth_6 or more, but not all', 'RemovedTeeth_All', 'RemovedTeeth_None of them', 'HadAngina_Yes', 'HadStroke_Yes', 'HadAsthma_Yes', 'HadSkinCancer_Yes', 'HadCOPD_Yes', 'HadDepressiveDisorder_Yes', 'HadKidneyDisease_Yes', 'HadArthritis_Yes', 'HadDiabetes_No, pre-diabetes or borderline diabetes', 'HadDiabetes_Yes', 'HadDiabetes_Yes, but only during pregnancy (female)', 'DeafOrHardOfHearing_Yes', 'BlindOrVisionDifficulty_Yes', 'DifficultyConcentrating_Yes', 'DifficultyWalking_Yes', 'DifficultyDressingBathing_Yes', 'DifficultyErrands_Yes', 'SmokerStatus_Current smoker - now smokes some days', 'SmokerStatus_Former smoker', 'SmokerStatus_Never smoked', 'ECigaretteUsage_Not at all (right now)', 'ECigaretteUsage_Use them every day', 'ECigaretteUsage_Use them some days', 'ChestScan_Yes', 'RaceEthnicityCategory_Hispanic', 'RaceEthnicityCategory_Multiracial, Non-Hispanic', 'RaceEthnicityCategory_Other race only, Non-Hispanic', 'RaceEthnicityCategory_White only, Non-Hispanic', 'AgeCategory_Age 25 to 29', 'AgeCategory_Age 30 to 34', 'AgeCategory_Age 35 to 39', 'AgeCategory_Age 40 to 44', 'AgeCategory_Age 45 to 49', 'AgeCategory_Age 50 to 54', 'AgeCategory_Age 55 to 59', 'AgeCategory_Age 60 to 64', 'AgeCategory_Age 65 to 69', 'AgeCategory_Age 70 to 74', 'AgeCategory_Age 75 to 79', 'AgeCategory_Age 80 or older', 'AlcoholDrinkers_Yes', 'HIVTesting_Yes', 'FluVaxLast12_Yes', 'PneumoVaxEver_Yes', 'TetanusLast10Tdap_Yes, received Tdap', 'TetanusLast10Tdap_Yes, received tetanus shot but not sure what type', 'TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap', 'HighRiskLastYear_Yes', 'CovidPos_Tested positive using home test without a health professional', 'CovidPos_Yes', 'Location_Northeast', 'Location_South', 'Location_West']
final_features = ['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours', 'HeightInMeters', 'BMI', 'Sex_Male', 'GeneralHealth_Fair', 'GeneralHealth_Good', 'GeneralHealth_Poor', 'GeneralHealth_Very good', 'LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)', 'LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)', 'LastCheckupTime_Within past year (anytime less than 12 months ago)', 'PhysicalActivities_Yes', 'RemovedTeeth_6 or more, but not all', 'RemovedTeeth_All', 'RemovedTeeth_None of them', 'HadAngina_Yes', 'HadStroke_Yes', 'HadAsthma_Yes', 'HadSkinCancer_Yes', 'HadCOPD_Yes', 'HadDepressiveDisorder_Yes', 'HadKidneyDisease_Yes', 'HadArthritis_Yes', 'HadDiabetes_No, pre-diabetes or borderline diabetes', 'HadDiabetes_Yes', 'HadDiabetes_Yes, but only during pregnancy (female)', 'DeafOrHardOfHearing_Yes', 'BlindOrVisionDifficulty_Yes', 'DifficultyConcentrating_Yes', 'DifficultyWalking_Yes', 'DifficultyDressingBathing_Yes', 'DifficultyErrands_Yes', 'SmokerStatus_Current smoker - now smokes some days', 'SmokerStatus_Former smoker', 'SmokerStatus_Never smoked', 'ECigaretteUsage_Not at all (right now)', 'ECigaretteUsage_Use them every day', 'ECigaretteUsage_Use them some days', 'ChestScan_Yes', 'AlcoholDrinkers_Yes', 'HIVTesting_Yes', 'FluVaxLast12_Yes', 'PneumoVaxEver_Yes', 'TetanusLast10Tdap_Yes, received Tdap', 'TetanusLast10Tdap_Yes, received tetanus shot but not sure what type', 'TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap', 'HighRiskLastYear_Yes', 'CovidPos_Tested positive using home test without a health professional', 'CovidPos_Yes', 'Location_Northeast', 'Location_South', 'Location_West']
len(final_features)
54
coefficients = final_logit_model.params
coef_df = pd.DataFrame(coefficients, columns=['Coefficient']).reset_index()
coef_df.rename(columns={'index': 'Feature'}, inplace=True)
final_features_with_const = ['const'] + final_features # 'const' 항목을 추가합니다.
coef_df = coef_df[coef_df['Feature'].isin(final_features_with_const)]
coef_df = coef_df.sort_values(by='Coefficient', ascending=False)
plt.figure(figsize=(10, 8))
sns.barplot(x='Coefficient', y='Feature', data=coef_df)
plt.title('Feature Importance from Logistic Regression')
plt.xlabel('Coefficient Value')
plt.ylabel('Feature')
plt.tight_layout() # 그래프가 잘리지 않도록 조정
plt.show()
X = df_encoded[final_features]
y = df_encoded['HadHeartAttack_Yes']
X_smote, y_smote = smote.fit_resample(X, y)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/imblearn/over_sampling/_smote/base.py:363: FutureWarning: The parameter `n_jobs` has been deprecated in 0.10 and will be removed in 0.12. You can pass an nearest neighbors estimator where `n_jobs` is already set instead. warnings.warn(
X_train , X_test, y_train, y_test = train_test_split(X_smote, y_smote , test_size=0.2, random_state=0)
scaler= StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
logit_model = LogisticRegression(max_iter=3000, random_state=42)
logit_model.fit(X_train_scaled,y_train)
auc = roc_auc_score(y_test, logit_model.predict_proba(X_test)[:, 1])
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))
Logistic Regression AUC on Test Set: 0.838
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/base.py:432: UserWarning: X has feature names, but LogisticRegression was fitted without feature names warnings.warn(
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2, random_state=0)
logit_model = LogisticRegression(max_iter=3000, random_state=42)
logit_model.fit(X_train, y_train)
y_pred = logit_model.predict(X_test)
y_pred_proba = logit_model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy: {:.3f}".format(accuracy))
print("Recall: {:.3f}".format(recall))
print("Precision: {:.3f}".format(precision))
print("F1 Score: {:.3f}".format(f1))
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))
Accuracy: 0.873 Recall: 0.869 Precision: 0.875 F1 Score: 0.872 Logistic Regression AUC on Test Set: 0.944
X_train, X_test, y_train, y_test = train_test_split(X_smote, y_smote, test_size=0.2, random_state=42)
logit_model = LogisticRegression(random_state=42, max_iter=3000)
logit_model.fit(X_train, y_train)
y_pred = logit_model.predict(X_test)
y_pred_prob = logit_model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_prob)
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))
lr_global = logit_model.explain_global()
show(lr_global)
Logistic Regression AUC on Test Set: 0.945
Interpret LR Plot is not showing in html and ipynb
Explainable Boosting Machine (EBM) is a machine learning algorithm that combines the principles of traditional gradient boosting with transparent and interpretable modeling techniques. EBM is designed to provide accurate predictions while also offering explanations or interpretations for those predictions, making it particularly useful in domains where understanding the model's reasoning is crucial, such as healthcare or finance.
Key features of EBM include:
Interpretability: EBM constructs models that are inherently interpretable, meaning that the relationships between input features and the predicted outcome are transparent and understandable. This transparency facilitates trust in the model's predictions and helps stakeholders comprehend the factors driving those predictions.
Additive modeling: Similar to traditional boosting algorithms, EBM builds an ensemble of weak learners (often decision trees) sequentially, where each subsequent learner focuses on capturing the patterns that were not adequately addressed by previous learners. However, EBM differs from other boosting methods by using additive rather than multiplicative updates, which simplifies the interpretation of the resulting model.
Monotonicity constraints: EBM allows users to impose monotonicity constraints on the relationships between input features and the predicted outcome. This means that users can specify whether they expect certain features to have a positive or negative impact on the prediction, thereby aligning the model's behavior with domain knowledge or business requirements.
Global and local explanations: EBM provides both global explanations, which describe the overall behavior of the model across the entire dataset, and local explanations, which explain individual predictions. Local explanations help users understand why a particular prediction was made for a specific instance, offering insights into the model's decision-making process.
Overall, EBM strikes a balance between predictive performance and interpretability, making it a valuable tool for applications where understanding the underlying logic of the model is as important as achieving high accuracy.
ebm = ExplainableBoostingClassifier(random_state=42)
ebm.fit(X_train, y_train)
y_pred = ebm.predict(X_test)
y_pred_proba = ebm.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
accuracy = accuracy_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print("Accuracy: {:.3f}".format(accuracy))
print("Recall: {:.3f}".format(recall))
print("Precision: {:.3f}".format(precision))
print("F1 Score: {:.3f}".format(f1))
print("Logistic Regression AUC on Test Set: {:.3f}".format(auc))
ebm_global = ebm.explain_global()
show(ebm_global)
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import ( /Users/kyusungcho/anaconda3/lib/python3.11/site-packages/pandas/core/arrays/masked.py:60: UserWarning: Pandas requires version '1.3.6' or newer of 'bottleneck' (version '1.3.5' currently installed). from pandas.core import (
Accuracy: 0.880 Recall: 0.879 Precision: 0.881 F1 Score: 0.880 Logistic Regression AUC on Test Set: 0.951
Interpret EBM Plot is not showing in html and ipynb
Explainable Boosting Machine (EBM) is a machine learning algorithm that combines the principles of traditional gradient boosting with transparent and interpretable modeling techniques. EBM is designed to provide accurate predictions while also offering explanations or interpretations for those predictions, making it particularly useful in domains where understanding the model's reasoning is crucial, such as healthcare or finance.
Key features of EBM include:
Interpretability: EBM constructs models that are inherently interpretable, meaning that the relationships between input features and the predicted outcome are transparent and understandable. This transparency facilitates trust in the model's predictions and helps stakeholders comprehend the factors driving those predictions.
Additive modeling: Similar to traditional boosting algorithms, EBM builds an ensemble of weak learners (often decision trees) sequentially, where each subsequent learner focuses on capturing the patterns that were not adequately addressed by previous learners. However, EBM differs from other boosting methods by using additive rather than multiplicative updates, which simplifies the interpretation of the resulting model.
Monotonicity constraints: EBM allows users to impose monotonicity constraints on the relationships between input features and the predicted outcome. This means that users can specify whether they expect certain features to have a positive or negative impact on the prediction, thereby aligning the model's behavior with domain knowledge or business requirements.
Global and local explanations: EBM provides both global explanations, which describe the overall behavior of the model across the entire dataset, and local explanations, which explain individual predictions. Local explanations help users understand why a particular prediction was made for a specific instance, offering insights into the model's decision-making process.
Overall, EBM strikes a balance between predictive performance and interpretability, making it a valuable tool for applications where understanding the underlying logic of the model is as important as achieving high accuracy.
We want to see if there's a difference in HadHeartAttack between patients who HadStroke (Group A) and those who didn't (Group B).
Calculate Propensity Scores: We use logistic regression to calculate propensity scores for each patient, representing the likelihood of HadStroke based on their characteristics (e.g., gender, general health, HadAngina).
Match Individuals: We then match students from Group A with similar propensity scores to students in Group B. For example, if a patient in Group A has a propensity score of 0.7, we find a patient in Group B with a similar score. The goal is to form pairs or sets of individuals who are similar in terms of their propensity to receive the treatment but differ in their actual receipt of the treatment.
Compare Outcomes: With the matched pairs, we can now compare the difference in HadHeartAttack between patients who HadStroke and those who didn't, within each pair.
Assess Results: We analyze the results to determine if there's a significant difference in HadHeartAttack between the two groups after accounting for the propensity score matching.
By using propensity score matching, we aim to reduce the influence of confounding variables and obtain a more accurate estimate of the effect of the HadStroke on HadHeartAttack.
By minimizing the influence of confounding variables, we have greater confidence that any observed differences in outcomes are indeed due to the treatment itself rather than other factors.
In the context of causal inference, the output causal.estimates typically refers to the estimated causal effects obtained from the causal model. The specific output may vary depending on the software or library being used, but commonly, it includes estimates of the Average Treatment Effect (ATE), Average Treatment Effect on the Treated (ATT), and Average Treatment Effect on the Control (ATC). Here's a brief explanation of each:
Average Treatment Effect (ATE): This represents the average causal effect of the treatment on the outcome across the entire population. It provides an estimate of how the outcome variable would change on average if everyone in the population were treated compared to if no one were treated.
Average Treatment Effect on the Treated (ATT): This measures the average causal effect of the treatment on the outcome among individuals who actually received the treatment. It provides insights into how the outcome variable would change for those who received the treatment compared to if they had not received it.
Average Treatment Effect on the Control (ATC): This measures the average causal effect of the treatment on the outcome among individuals who did not receive the treatment. It provides insights into how the outcome variable would change for those who did not receive the treatment compared to if they had received it.
These estimates help researchers understand the impact of the treatment variable on the outcome variable in different subpopulations and provide valuable insights for decision-making and policy formulation. The specific values of ATE, ATT, and ATC obtained from the causal.estimates output will depend on the dataset, the causal modeling approach used, and any assumptions or specifications made during the estimation process.
logit = LogisticRegression()
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1)
y_propensity = df_encoded['HadStroke_Yes']
logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]
df_encoded['propensity_score'] = propensity_scores
treated = df_encoded[df_encoded['HadStroke_Yes'] == 1]
control = df_encoded[df_encoded['HadStroke_Yes'] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])
treated_effect = matched_data[matched_data['HadStroke_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadStroke_Yes'] == 0]['HadHeartAttack_Yes'].mean()
print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.25 Heart attack incidence rate in the control group: 0.15
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadStroke_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1).values
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.060 0.013 4.549 0.000 0.034 0.085
ATC 0.057 0.014 4.219 0.000 0.031 0.084
ATT 0.113 0.006 18.382 0.000 0.101 0.125
# treated.head(5)
# matched_control.head(5)
# matched_data.head(5)
from causalinference import CausalModel
logit = LogisticRegression()
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1)
y_propensity = df_encoded['HadAngina_Yes']
logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]
df_encoded['propensity_score'] = propensity_scores
treated = df_encoded[df_encoded['HadAngina_Yes'] == 1]
control = df_encoded[df_encoded['HadAngina_Yes'] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])
treated_effect = matched_data[matched_data['HadAngina_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadAngina_Yes'] == 0]['HadHeartAttack_Yes'].mean()
print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.45 Heart attack incidence rate in the control group: 0.10
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadAngina_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1).values
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.311 0.015 20.751 0.000 0.281 0.340
ATC 0.307 0.016 19.390 0.000 0.276 0.338
ATT 0.364 0.007 54.992 0.000 0.351 0.377
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadStroke_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1).values
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.054 0.013 4.093 0.000 0.028 0.080
ATC 0.052 0.014 3.782 0.000 0.025 0.079
ATT 0.107 0.006 17.437 0.000 0.095 0.119
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadStroke_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadStroke_Yes'], axis=1).values
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.054 0.013 4.093 0.000 0.028 0.080
ATC 0.052 0.014 3.782 0.000 0.025 0.079
ATT 0.107 0.006 17.437 0.000 0.095 0.119
df_encoded.shape
(246022, 72)
#from causalinference import CausalModel
# # Outcome variable
# Y = df_encoded['HadHeartAttack_Yes'].values
# for col in df_encoded.columns.drop('HadHeartAttack_Yes'):
# # Treatment variable for the current iteration
# D = df_encoded[col].values
# # Covariates excluding the current treatment variable and the outcome
# X = df_encoded.drop(['HadHeartAttack_Yes', col], axis=1).values
# # Initialize and estimate the causal model
# causal = CausalModel(Y, D, X)
# causal.est_propensity()
# causal.est_via_matching()
# # Print the causal estimates for the current treatment variable
# print(f"Causal estimates for treatment variable '{col}':")
# print(causal.estimates)
df_encoded.columns
Index(['PhysicalHealthDays', 'MentalHealthDays', 'SleepHours',
'HeightInMeters', 'BMI', 'Sex_Male', 'GeneralHealth_Fair',
'GeneralHealth_Good', 'GeneralHealth_Poor', 'GeneralHealth_Very good',
'LastCheckupTime_Within past 2 years (1 year but less than 2 years ago)',
'LastCheckupTime_Within past 5 years (2 years but less than 5 years ago)',
'LastCheckupTime_Within past year (anytime less than 12 months ago)',
'PhysicalActivities_Yes', 'RemovedTeeth_6 or more, but not all',
'RemovedTeeth_All', 'RemovedTeeth_None of them', 'HadAngina_Yes',
'HadStroke_Yes', 'HadAsthma_Yes', 'HadSkinCancer_Yes', 'HadCOPD_Yes',
'HadDepressiveDisorder_Yes', 'HadKidneyDisease_Yes', 'HadArthritis_Yes',
'HadDiabetes_No, pre-diabetes or borderline diabetes',
'HadDiabetes_Yes',
'HadDiabetes_Yes, but only during pregnancy (female)',
'DeafOrHardOfHearing_Yes', 'BlindOrVisionDifficulty_Yes',
'DifficultyConcentrating_Yes', 'DifficultyWalking_Yes',
'DifficultyDressingBathing_Yes', 'DifficultyErrands_Yes',
'SmokerStatus_Current smoker - now smokes some days',
'SmokerStatus_Former smoker', 'SmokerStatus_Never smoked',
'ECigaretteUsage_Not at all (right now)',
'ECigaretteUsage_Use them every day',
'ECigaretteUsage_Use them some days', 'ChestScan_Yes',
'RaceEthnicityCategory_Hispanic',
'RaceEthnicityCategory_Multiracial, Non-Hispanic',
'RaceEthnicityCategory_Other race only, Non-Hispanic',
'RaceEthnicityCategory_White only, Non-Hispanic',
'AgeCategory_Age 25 to 29', 'AgeCategory_Age 30 to 34',
'AgeCategory_Age 35 to 39', 'AgeCategory_Age 40 to 44',
'AgeCategory_Age 45 to 49', 'AgeCategory_Age 50 to 54',
'AgeCategory_Age 55 to 59', 'AgeCategory_Age 60 to 64',
'AgeCategory_Age 65 to 69', 'AgeCategory_Age 70 to 74',
'AgeCategory_Age 75 to 79', 'AgeCategory_Age 80 or older',
'AlcoholDrinkers_Yes', 'HIVTesting_Yes', 'FluVaxLast12_Yes',
'PneumoVaxEver_Yes', 'TetanusLast10Tdap_Yes, received Tdap',
'TetanusLast10Tdap_Yes, received tetanus shot but not sure what type',
'TetanusLast10Tdap_Yes, received tetanus shot, but not Tdap',
'HighRiskLastYear_Yes',
'CovidPos_Tested positive using home test without a health professional',
'CovidPos_Yes', 'HadHeartAttack_Yes', 'Location_Northeast',
'Location_South', 'Location_West', 'propensity_score'],
dtype='object')
logit = LogisticRegression()
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1)
y_propensity = df_encoded['HadAngina_Yes']
logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]
df_encoded['propensity_score'] = propensity_scores
treated = df_encoded[df_encoded['HadAngina_Yes'] == 1]
control = df_encoded[df_encoded['HadAngina_Yes'] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])
treated_effect = matched_data[matched_data['HadAngina_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadAngina_Yes'] == 0]['HadHeartAttack_Yes'].mean()
print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.45 Heart attack incidence rate in the control group: 0.09
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadAngina_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadAngina_Yes'], axis=1).values # Covariates
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.313 0.015 20.940 0.000 0.283 0.342
ATC 0.309 0.016 19.578 0.000 0.278 0.340
ATT 0.364 0.007 55.012 0.000 0.351 0.377
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadDiabetes_Yes'], axis=1)
y_propensity = df_encoded['HadDiabetes_Yes']
logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]
df_encoded['propensity_score'] = propensity_scores
treated = df_encoded[df_encoded['HadDiabetes_Yes'] == 1]
control = df_encoded[df_encoded['HadDiabetes_Yes'] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])
treated_effect = matched_data[matched_data['HadDiabetes_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadDiabetes_Yes'] == 0]['HadHeartAttack_Yes'].mean()
print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Heart attack incidence rate in the treated group: 0.14 Heart attack incidence rate in the control group: 0.11
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Y = df_encoded['HadHeartAttack_Yes'].values # Outcome variable
D = df_encoded['HadDiabetes_Yes'].values # Treatment variable
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadDiabetes_Yes'], axis=1).values # Covariates
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.016 0.005 3.027 0.002 0.006 0.026
ATC 0.012 0.006 2.113 0.035 0.001 0.024
ATT 0.037 0.003 12.814 0.000 0.031 0.042
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'SmokerStatus_Former smoker'], axis=1)
y_propensity = df_encoded['SmokerStatus_Former smoker']
logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]
df_encoded['propensity_score'] = propensity_scores
treated = df_encoded[df_encoded['SmokerStatus_Former smoker'] == 1]
control = df_encoded[df_encoded['SmokerStatus_Former smoker'] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])
treated_effect = matched_data[matched_data['SmokerStatus_Former smoker'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['SmokerStatus_Former smoker'] == 0]['HadHeartAttack_Yes'].mean()
print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
/Users/kyusungcho/anaconda3/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning:
lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
Increase the number of iterations (max_iter) or scale the data as shown in:
https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
Heart attack incidence rate in the treated group: 0.08 Heart attack incidence rate in the control group: 0.10
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['SmokerStatus_Former smoker'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'SmokerStatus_Former smoker'], axis=1).values # Covariates
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
X_propensity = df_encoded.drop(['HadHeartAttack_Yes', 'HadKidneyDisease_Yes'], axis=1)
y_propensity = df_encoded['HadKidneyDisease_Yes']
logit.fit(X_propensity, y_propensity)
propensity_scores = logit.predict_proba(X_propensity)[:, 1]
df_encoded['propensity_score'] = propensity_scores
treated = df_encoded[df_encoded['HadKidneyDisease_Yes'] == 1]
control = df_encoded[df_encoded['HadKidneyDisease_Yes'] == 0]
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(control[['propensity_score']])
distances, indices = nn.kneighbors(treated[['propensity_score']])
matched_control = control.iloc[indices.flatten()]
matched_data = pd.concat([treated, matched_control])
treated_effect = matched_data[matched_data['HadKidneyDisease_Yes'] == 1]['HadHeartAttack_Yes'].mean()
control_effect = matched_data[matched_data['HadKidneyDisease_Yes'] == 0]['HadHeartAttack_Yes'].mean()
print(f"Heart attack incidence rate in the treated group: {treated_effect:.2f}")
print(f"Heart attack incidence rate in the control group: {control_effect:.2f}")
Y = df_encoded['HadHeartAttack_Yes'].values
D = df_encoded['HadKidneyDisease_Yes'].values
X = df_encoded.drop(['HadHeartAttack_Yes', 'HadKidneyDisease_Yes'], axis=1).values # Covariates
causal = CausalModel(Y, D, X)
causal.est_propensity()
causal.est_via_matching()
print(causal.estimates)
Treatment Effect Estimates: Matching
Est. S.e. z P>|z| [95% Conf. int.]
--------------------------------------------------------------------------------
ATE 0.005 0.010 0.538 0.591 -0.014 0.025
ATC 0.004 0.010 0.404 0.686 -0.016 0.024
ATT 0.030 0.005 5.829 0.000 0.020 0.040